NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Man, Yunze; Huang, De-An; Liu, Guilin; Sheng, Shiwei; Liu, Shilong; Gui, Liang-Yan; Kautz, Jan; Wang, Yu-Xiong; Yu, Zhiding (June 2025, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Free, publicly-accessible full text available June 11, 2026
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Zhao, Yue; Xue, Fuzhao; Reed, Scott; Fan, Linxi; Zhu, Yuke; Kautz, Jan; Yu, Zhiding; Krähenbühl, Philipp; Huang, De-An (February 2025, cs.CV)

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
more » « less
Free, publicly-accessible full text available February 7, 2026
EAGLE : Exploring the Design Space for Multi-modal LLMs with Mixture of Encoders

Shi, Min; Liu, Fuxiao; Wang, Shihao; Liao, Shijia; Radhakrishnan, Subhashree; Zhao, Yilin; Huang, De-an; Yin, Hongxu; Sapra, Karan; Yccoob, Yaser; et al (April 2025, ICLR 2025)

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
more » « less
Free, publicly-accessible full text available April 24, 2026
SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

https://doi.org/10.1109/IROS58592.2024.10802143

Li, Yiming; Li, Sihang; Liu, Xinhao; Gong, Moonjun; Li, Kenan; Chen, Nuo; Wang, Zijun; Li, Zhiheng; Jiang, Tao; Yu, Fisher; et al (October 2024, IEEE)

Full Text Available
Delving Deeper into Anti-Aliasing in ConvNets

https://doi.org/10.1007/s11263-022-01672-y

Zou, Xueyan; Xiao, Fanyi; Yu, Zhiding; Li, Yuheng; Lee, Yong Jae (January 2022, International Journal of Computer Vision)

Aliasing refers to the phenomenon that high frequency signals degenerate into completely different ones after sampling. It arises as a problem in the context of deep learning as downsampling layers are widely adopted in deep architectures to reduce parameters and computation. The standard solution is to apply a lowpass filter (e.g., Gaussian blur) before downsampling. However, it can be suboptimal to apply the same filter across the entire content, as the frequency of feature maps can vary across both spatial locations and feature channels. To tackle this, we propose an adaptive content-aware low-pass filtering layer, which predicts separate filter weights for each spatial location and channel group of the input feature maps. We investigate the effectiveness and generalization of the proposed method across multiple tasks, including image classification, semantic segmentation, instance segmentation, video instance segmentation, and image-to-image translation. Both qualitative and quantitative results demonstrate that our approach effectively adapts to the different feature frequencies to avoid aliasing while preserving useful information for recognition. Code is available at https://maureenzou.github.io/ddac/
more » « less
Full Text Available
Delving Deeper into Anti-aliasing in ConvNets

Zou, Xueyan; Xiao, Fanyi; Yu, Zhiding; Lee, Yong Jae (January 2020, BMVC)
null (Ed.)
Full Text Available
Regularizing Neural Networks via Minimizing Hyperspherical Energy

Lin, Rongmei; Liu, Weiyang; Liu, Zhen; Feng, Chen; Yu, Zhiding; Rehg, James M.; Xiong, Li; Song, Le (June 2020, IEEE Conference on Computer Vision and Pattern Recognition)

Inspired by the Thomson problem in physics where the distribution of multiple propelling electrons on a unit sphere can be modeled via minimizing some potential energy, hyperspherical energy minimization has demonstrated its potential in regularizing neural networks and improving their generalization power. In this paper, we first study the important role that hyperspherical energy plays in neural network training by analyzing its training dynamics. Then we show that naively minimizing hyperspherical energy suffers from some difficulties due to highly non-linear and non-convex optimization as the space dimensionality becomes higher, therefore limiting the potential to further improve the generalization. To address these problems, we propose the compressive minimum hyperspherical energy (CoMHE) as a more effective regularization for neural networks. Specifically, CoMHE utilizes projection mappings to reduce the dimensionality of neurons and minimizes their hyperspherical energy. According to different designs for the projection mapping, we propose several distinct yet well-performing variants and provide some theoretical guarantees to justify their effectiveness. Our experiments show that CoMHE consistently outperforms existing regularization methods, and can be easily applied to different neural networks.
more » « less
Full Text Available
Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection

https://doi.org/10.1109/CVPR42600.2020.01061

Ren, Zhongzheng; Yu, Zhiding; Yang, Xiaodong; Liu, Ming-Yu; Lee, Yong Jae; Schwing, Alexander G.; Kautz, Jan (June 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Full Text Available
Regularizing Neural Networks via Minimizing Hyperspherical Energy

Lin, Rongmei; Liu, Weiyang; Liu, Zhen; Feng, Chen; Yu, Zhiding; Rehg, James M; Xiong, Li; Song, Le (June 2020, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition)
null (Ed.)
Inspired by the Thomson problem in physics where the distribution of multiple propelling electrons on a unit sphere can be modeled via minimizing some potential energy, hyperspherical energy minimization has demonstrated its potential in regularizing neural networks and improving their generalization power. In this paper, we first study the important role that hyperspherical energy plays in neural network training by analyzing its training dynamics. Then we show that naively minimizing hyperspherical energy suffers from some difficulties due to highly non-linear and non-convex optimization as the space dimensionality becomes higher, therefore limiting the potential to further improve the generalization. To address these problems, we propose the compressive minimum hyperspherical energy (CoMHE) as a more effective regularization for neural networks. Specifically, CoMHE utilizes projection mappings to reduce the dimensionality of neurons and minimizes their hyperspherical energy. According to different designs for the projection mapping, we propose several distinct yet well-performing variants and provide some theoretical guarantees to justify their effectiveness. Our experiments show that CoMHE consistently outperforms existing regularization methods, and can be easily applied to different neural networks.
more » « less
Full Text Available

Search for: All records